ACCEPTED PAPERS
- Article Category: PAPERS
- 11/30/2022
Paper ID | Paper Title | Authors |
9 | Transduce and Speak: Neural Transducer for Text-To-Speech with Semantic Token Prediction | Minchan Kim (Seoul National University)*; Myeonghun Jeong (Seoul National University); Byoung Jin Choi (Seoul National University); Dongjune Lee (Seoul National University); Nam Soo Kim (Seoul National University) |
11 | Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-To-Sequence End-To-End Spoken Language Understanding | Pavel Denisov (University of Stuttgart)*; Ngoc Thang Vu (University of Stuttgart) |
12 | LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models | Chi-Chang Lee (Academia Sinica)*; Hong Wei Chen (National Taiwan University); Chu-Song Chen (National Taiwan University); Hsin-Min Wang (Academia Sinica); Tsung-Te Liu (National Taiwan University); Yu Tsao (Academia Sinica) |
22 | Variational Gaussian Process Data Uncertainty | Jeremy H. M. Wong (Institute for Infocomm Research)*; Huayun Zhang (ASTAR ); Nancy Chen (Institute for Infocomm Research) |
26 | Low-rank Adaptation of Neural Language Model Rescoring for Speech Recognition | Yu Yu (Stevens Institute of Technology); Chao-Han Huck Yang (Amazon)*; Jari T Kolehmainen (Amazon); Prashanth Gurunath Shivakumar (Amazon); Yile Gu (Amazon); Sungho Ryu (Amazon); Roger Ren (Amazon); Qi Luo (Amazon.com Inc.); Aditya Gourav (Amazon); I-Fan Chen (Amazon Inc.); Yi Chieh Liu (Amazon); Tuan Dinh (Amazon); Denis Filimonov (Amazon); Ankur Gandhe (Amazon Alexa); Andreas Stolcke (Amazon); Ariya Rastrow (Amazon Alexa); Ivan Bulyko (Amazon) |
27 | CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers | Xintong Wang (XiaoIce ); Chang Zeng (National Institute of Informatics)*; Jun Chen (Tsinghua University); wang chun hui (XiaoIce) |
30 | Active Learning Based Fine-Tuning Framework for Speech Emotion Recognition | Dongyuan Li (Tokyo Institute of Technology)*; Yusong WANG (Tokyo Institute of Technology); Kotrao Funakoshi (Tokyo Institute of Technology); Manabu Okumura (Tokyo Institute of Technology) |
32 | Identifying People with Mild Cognitive Impairment At Risk of Developing Dementia Using Speech Analysis | Bahman Mirheidari (University of Sheffield)*; Ronan O’Malley (University of Sheffield); Daniel Blackburn (University of Sheffield); Heidi Christensen (University of Sheffield) |
36 | Bisinger: Bilingual Singing Voice Synthesis | Huali Zhou (Wuhan University); Yueqian Lin (Duke Kunshan University); Yao Shi (Duke Kunshan University); Peng Sun (Duke Kunshan University); Ming Li (Duke Kunshan University)* |
44 | Robust Recognition of Speaker Emotion with Difference Feature Extraction Using a Few Enrollment Utterances | Daichi Hayakawa (Toshiba Corporation Corporate R&D Center)*; Takehiko Kagoshima (Toshiba Corporation Corporate R&D Center); Kenji Iwata (Toshiba Corporation Corporate R&D Center); Rama S Doddipatla (Toshiba Europe LTD); Norbert Braunschweiler (Toshiba Europe Limited) |
47 | Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking | Jihyun Lee (Pohang University of Science and Technology)*; Yejin Jeon (POSTECH); Wonjun Lee (POSTECH); Yunsu Kim (POSTECH); Gary Geunbae Lee (Postech) |
48 | Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition | Yuang Li (University of Cambridge)*; Yu Wu (Microsoft Research Asia); Jinyu Li (Microsoft); Shujie Liu (Microsoft Research Asia) |
50 | Mbtfnet: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement | Weiming Xu (Northwest Polytechnic University)*; Xuanzhou Chen (Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Zhili Tan (Tencent); Shubo Lv (Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University); Runduo Han (Northwestern Polytechnical University); Wenjiang Zhou ( Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Weifeng Zhao ( Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Lei Xie (NWPU) |
51 | Can We Use Speaker Embeddings on Spontaneous Speech Obtained From Medical Conversations to Predict Intelligibility? | Sebastião Quintas (IRIT, Université de Toulouse, CNRS, Toulouse, France)*; Mathieu Balaguer (IRIT); Julie Mauclair (IRIT); Virginie Woisard (Hospitals of Toulouse); Julien Pinquier (IRIT) |
53 | End-to-End Training of a Neural HMM with Label and Transition Probabilities | Daniel Mann (RWTH Aachen University)*; Tina Raissi (RWTH Aachen University); Wilfried Michel (AppTek); Ralf Schlüter (RWTH Aachen University); Hermann Ney ( RWTH Aachen University) |
54 | Wiki-En-Asr-Adapt: Large-Scale Synthetic Dataset for English Asr Customization | Alexandra A Antonova (Moscow Institute of Physics and Technology)* |
58 | The Role of Feature Correlation on Quantized Neural Networks | David Qiu (Google)*; Shaojin Ding (Google); Yanzhang He (Google) |
59 | LV-CTC: Non-autoregressive ASR With CTC and Latent Variable Models | Yuya Fujita (Yahoo Japan Corporation)*; Shinji Watanabe (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Takashi Maekaku (Yahoo Japan Corporation) |
64 | The Singing Voice Conversion Challenge 2023 | Wen-Chin Huang (Nagoya University)*; Lester Phillip G Violeta (Nagoya University); Songxiang Liu (Tencent); Jiatong Shi (Carnegie Mellon University); Tomoki Toda (Nagoya University) |
65 | Improving Multilingual and Code-switching ASR using Large Language Model Generated Text | Ke Hu (Google)*; Tara Sainath (Google); Bo Li (Google); Yu Zhang (Google); Yong Cheng (Google); Tao Wang (Google Inc.); Yujing Zhang (Google); Frederick Liu (Google Inc.) |
66 | Pareto Efficiency of Learning-Forgetting Trade-Off in Neural Language Model Adaptation | Jerome R Bellegarda (Apple)* |
72 | Improved Multi-modal Emotion Recognition using Squeeze-and-Excitation Block in Cross-Modal Attention | Junchen Liu (The University of Auckland)*; Jesin James (The University of Auckland); Karan Nathwani (Indian Institute of Technology, Jammu) |
73 | Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer | Jin Qiu (ByteDance); Lu Huang (ByteDance)*; Boyu Li (ByteDance); Jun Zhang (Bytedance); Lu Lu (Bytedance); Zejun Ma (Bytedance) |
76 | Locality Enhanced Dynamic Biasing and Sampling Strategies for Contextual ASR | Md Asif Jalal (Samsung Research UK)*; Pablo Peso Parada (Samsung Research UK); George Pavlidis (Information Technologies Institute, Centre for Research and Technology); Vasileios Moschopoulos (Information Technologies Institute, Centre for Research and Technology - Hellas, Thessaloniki, Greece); KARTHIKEYAN SARAVANAN (Samsung Research, UK); Chrysovalantis G Kontoulis (Pragma-IoT); Jisi Zhang (Samsung Research UK); Anastasios Drosou (Information Technologies Institute, Centre for Research and Technology - Hellas, Thessaloniki, Greece); Jung In Lee (Samsung Electronics); Gil Ho Lee (Samsung Electronics); Seokyeong Jung (Samsung Electronics) |
77 | Robust End-To-End Diarization with Domain Adaptive Training and Multi-Task Learning | Ivan Fung (Fano Labs)*; Lahiru T Samarakoon (Fano Labs, Hong Kong); Samuel J Broughton (Fano Labs) |
78 | Whisper-SLU: Extending a Pretrained Speech-to-Text Transformer for Low Resource Spoken Language Understanding | Quentin Meeus (KU Leuven)*; Sien Moens (KU Leuven); Hugo Van hamme (KU Leuven) |
80 | Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that leverages a Universal Speech Model | Hagen Soltau (Google)*; Izhak Shafran (Google AI); Alex Ottenwess (Google); Joseph R. JR Duffy (Mayo Clinic); Rene L Utianski (Mayo Clinic); Leland R. Barnard (Mayo); John L. Stricker (Mayo Clinic); Daniela Wiepert (Mayo Clinic); David T. Jones (Mayo Clinic); Hugo Botha (Mayo Clinic) |
82 | Contextual Spelling Correction With Large Language Models | Gan Song (Google)*; Zelin Wu (Google LLC); Golan Pundak (Google); Angad Chandorkar (Google); Xavier Velez (Google); Diamantino Caseiro (Google); Ben Haynor (Google); Weiran Wang (Google); Nikhil Siddhartha (Google); Kandarp Joshi (Google); Pat Rondon (Google); Khe C Sim (Google Inc.) |
83 | Not All Errors Are Created Equal: Evaluating The Impact of Model And Speaker Factors on ASR Outcomes in Clinical Populations | Daniela Wiepert (Mayo Clinic)*; Rene L Utianski (Mayo Clinic); Joseph Duffy (Mayo Clinic); John Stricker (Mayo Clinic); Leland Barnard (Mayo Clinic); Keith Josephs (Mayo Clinic Rochester); Jennifer Whitwell (Mayo Clinic Rochester); David Jones (Mayo Clinic); Hugo Botha (Mayo Clinic) |
85 | The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning | Lillian Zhou (Google)*; Yuxin Ding (Google); Mingqing Chen (Google Inc.); Harry Zhang (Google); Rohit Prabhavalkar (Google); Dhruv Guliani (Google); Giovanni Motta (Google, Inc.); Rajiv Mathews (Google) |
90 | FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition | Dongning Yang (Shanghai Jiao Tong University)*; wei wang (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University) |
96 | Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition | Dongji Gao (Johns Hopkins University)*; Hainan Xu (NVIDIA); Desh Raj (Johns Hopkins University); Paola Garcia (Johns Hopkins University); Daniel Povey (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University) |
97 | Acoustics-Text Dual-Modal Joint Representation Learning for Cover Song Identification | Yanmei Gu (AntGroup)*; Li Jing (AntGroup); Zhou Jiayi (AntGroup); Wang Zhiming (AntGroup); Zhu Huijia (AntGroup) |
99 | Towards Matching Phones and Speech Representations | Gene-Ping Yang (The University of Edinburgh)*; Hao Tang (The University of Edinburgh) |
101 | RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain | Sangeet Sagar (Saarland University )*; Mirco Ravanelli (Université de Montréal); Bernd Kiefer (DFKI); Ivana Kruijff (DFKI); Josef van Genabith (Saarland University) |
103 | Mask-Conformer: Augmenting Conformer with Mask-Predict Decoder | Yosuke Higuchi (Waseda University)*; Andrew Rosenberg (Google LLC); Yuan Wang (Google); Murali Karthick Baskar (Google Inc); Bhuvana Ramabhadran (Google) |
105 | Ed-Cec: Improving Rare Word Recognition Using ASR Post-Processing Based on Error Detection and Context-Aware Error Correction | Jiajun He (Nagoya University)*; Zekun Yang (Nagoya University); Tomoki Toda (Nagoya University) |
108 | Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning | guanrou yang (Shanghai Jiao Tong University)*; Xie Chen (Shanghai Jiaotong University); Ziyang Ma (Shanghai Jiao Tong University); Zhisheng Zheng (Shanghai Jiao Tong University ); Yakun Song (Shanghai Jiao Tong University); Zhikang Niu (Xidian University) |
109 | Can Unpaired Textual Data Replace Synthetic Speech In ARU Model Adaptation? | Pasquale D'Alterio (Amazon)*; Christian Hensel (Amazon); Bashar Awwad Shiekh Hasan (Amazon) |
110 | Preserving Phonemic Distinctions For Ordinal Regression: A Novel Loss Function For Automatic Pronunciation Assessment | Bi-Cheng Yan (National Taiwan Normal University )*; Hsin-Wei Wang (NTNU); Yi-Cheng Wang (National Taiwan Normal University); Jiun-Ting Li (National Taiwan Normal University); Chi-Han Lin (E.SUN Financial Holding Co., Ltd.); Berlin Chen (National Taiwan Normal University) |
111 | Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition | Yujin Wang (Tsinghua University); Changli Tang (Tsinghua University)*; Ziyang Ma (Shanghai Jiao Tong University); Zhisheng Zheng (Shanghai Jiao Tong University ); Xie Chen (Shanghai Jiaotong University); Wei-Qiang Zhang (Tsinghua University) |
112 | Efficient Cascaded Streaming ASR System via Frame Rate Reduction | Xingyu Cai (Google)*; David Qiu (Google); Shaojin Ding (Google); Dongseong Hwang (Google); Weiran Wang (Google); Antoine Bruguier (Google); Rohit Prabhavalkar (Google); Tara Sainath (Google); Yanzhang He (Google) |
119 | VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention | Yuewei Zhang (Shanghai Jiao Tong University)*; Huanbin Zou (Tencent); jie zhu (Shanghai Jiao Tong University) |
123 | Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization | Wei-Ping Huang (National Taiwan University)*; Sung-Feng Huang (National Taiwan University); Hung-yi Lee (National Taiwan University) |
124 | Meta-Learning Framework for End-To-End Imposter Identification in Unseen Speaker Recognition | Ashutosh Chaubey (LG Ad Solutions); Sparsh Sinha (LG Ad Solutions)*; Susmita Ghose (LG Ad Solutions) |
127 | Using Joint Training Speaker Encoder with Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice Conversion | Houjian Guo (Osaka Univeristy, Riken Guardian Robot Group); Chaoran Liu (Riken)*; Carlos T Ishi (RIKEN); Hiroshi Ishiguro (Osaka University) |
128 | Quickvc: a Lightweight VITS-Based Any-To-Many Voice Conversion Model Using iSTFT for Faster Conversion | Houjian Guo (Osaka Univeristy, Riken Guardian Robot Group); Chaoran Liu (Riken)*; Carlos T Ishi (RIKEN); Hiroshi Ishiguro (Osaka University) |
130 | Multi Transcription-Style Speech Transcription Using Attention-Based Encoder-Decoder Model | Yan Huang (Microsoft Research)*; Piyush Behre (Microsoft); Guoli Ye (Microsoft); Shawn Chang (); Yifan Gong (Microsoft) |
134 | NeuralKalman: A Learnable Kalman Filter for Acoustic Echo Cancellation | Yixuan Zhang (The Ohio State University)*; Meng Yu (Tencent); Hao Zhang (Tencent AI Lab); Dong Yu (Tencent AI Lab); DeLiang Wang (Ohio State University) |
135 | Thai-Dialect: Low Resource Thai Dialectal Speech to Text Corpora | Artit Suwanbandit (Chulalongkorn University)*; Jaturong Chitiyaphol (KhonKaen University); Sutthinan Chuenchom (Chiang Mai Rajabhat University); Kanyarat Kwiecien (Khon Kaen University); Husen Sawal (Prince of Songkla University); Ruslan Uthai (Prince of Songkla University); Orathai Sangpetch (CMKL University ); Ekapol Chuangsuwanich (Chulalongkorn University) |
140 | Deep Learning for Joint Acoustic Echo and Acoustic Howling Suppression in Hybrid Meetings | Hao Zhang (Tencent AI Lab)*; Meng Yu (Tencent); Dong Yu (Tencent AI Lab) |
142 | On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments | William Ravenscroft (The University of Sheffield)*; Stefan Goetze (University of Sheffield); Thomas Hain (University of Sheffield) |
143 | NeuralEcho: Hybrid of Full-Band and Sub-Band Recurrent Neural Network for Acoustic Echo Cancellation and Speech Enhancement | Meng Yu (Tencent)*; Yong Xu (Tencent); Chunlei zhang (Tencent AI Lab); Shixiong Zhang (Tencent AI Lab); Dong Yu (Tencent AI Lab) |
144 | Combining relative and absolute learning formulations to predict emotional attributes from speech | Abinay Reddy Naini (The University of Texas at Dallas); Shruthi Subramanium (The University of Texas at Dallas); Seong-Gyun Leem (University of Texas at Dallas); Carlos Busso (University of Texas at Dallas)* |
145 | ESPNet-SUMM: Introducing a novel large dataset, toolkit, and a cross-corpora evaluation of speech summarization systems | Roshan S Sharma (Carnegie Mellon University)*; William Chen (Carnegie Mellon University); Takatomo Kano (NTT Corporation); Ruchira S Sharma (University of Massachusetts, Amherst); Atsunori Ogawa (NTT Corporation); Siddhant Arora (Carnegie Mellon University); Marc Delcroix (NTT); Rita Singh (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Bhiksha Raj (Carnegie Mellon University) |
146 | Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data | Yifan Peng (Carnegie Mellon University)*; Jinchuan Tian (Carnegie Mellon University); Brian Yan (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Xinjian Li (Carnegie Mellon University); Jiatong Shi (Carnegie Mellon University); Siddhant Arora (Carnegie Mellon University); William Chen (Carnegie Mellon University); Roshan S Sharma (Carnegie Mellon University); Wangyou Zhang (Shanghai Jiao Tong University); Yui Sudo (Honda Research Institute Japan); Muhammad Mr. Shakeel (Honda Research Institute Japan); Jee-weon Jung (Carnegie Mellon University); Soumi Maiti (CMU); Shinji Watanabe (Carnegie Mellon University) |
155 | Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference | Masao Someki (IBM)*; Nicholas Eng (The University of Auckland); Yosuke Higuchi (Waseda University); Shinji Watanabe (Carnegie Mellon University) |
156 | Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning | William Chen (Carnegie Mellon University)*; Jiatong Shi (Carnegie Mellon University); Brian Yan (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Wangyou Zhang (Shanghai Jiao Tong University); Yifan Peng (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Soumi Maiti (CMU); Shinji Watanabe (Carnegie Mellon University) |
163 | Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond | Jiatong Shi (Carnegie Mellon University)*; William Chen (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Hsiu-Hsuan Wang (National Taiwan University ); Wei Ping Huang (National Taiwan University); En Pei Hu (National Taiwan University); ho lam Chung (National Taiwan University); Xuankai Chang (Carnegie Mellon University); Yuxun Tang (Renmin University of China); Shang-Wen Li (Meta AI); Abdelrahman Mohamed (Rembrand Inc); Hung-yi Lee (National Taiwan University); Shinji Watanabe (Carnegie Mellon University) |
165 | Diffusion-Based Mel-Spectrogram Enhancement For Personalized Speech Synthesis With Found Data | Yusheng Tian (The Chinese University of Hong Kong)*; Wei Liu (The Chinese University of Hong Kong); Tan Lee (The Chinese University of Hong Kong) |
166 | Sqat-Ld: Speech Quality Assessment Transformer Utilizing Listener Dependent Modeling For Zero-Shot Out-Of-Domain Mos Prediction | Kailai Shen (Ningbo University); Diqun Yan (Ningbo University)*; Li Dong (Ningbo University); Ren Ying (Ningbo University); Xiaoxun Wu (Ningbo University); Jing Hu (Ningbo University) |
170 | Scenario-Aware Audio-Visual Tf-Gridnet For Target Speech Extraction | Zexu Pan (National University of Singapore)*; Gordon Wichern (Mitsubishi Electric Research Laboratories (MERL)); Yoshiki Masuyama (Tokyo Metropolitan University); François G Germain (Mitsubishi Electric Research Laboratories (MERL)); Sameer Khurana (Mitsubishi Electric Research Lab); Chiori Hori (Mitsubishi Electric Research Laboratories (MERL)); Jonathan LeRoux (Mitsubishi Electric Research Laboratories (MERL)) |
171 | Generative Asr Error Correction With Large Language Models | Chao-Han Huck Yang (Amazon)*; Yile Gu (Amazon.com, USA); Yi-Chieh Liu (Georgia Institute of Technology ); Shalini Ghosh (Amazon Alexa AI); Ivan Bulyko (Amazon); Andreas Stolcke (Amazon) |
172 | Enhancing Task-Oriented Dialogues With Chitchat: A Comparative Study Based On Lexical Diversity And Divergence | Armand Stricker (LISN, CNRS)*; Patrick Paroubek (LISN) |
174 | Token-Level Serialized Output Training For Joint Streaming Asr And St Leveraging Textual Alignments | Sara Papi (FBK)*; Peidong Wang (Microsoft); Junkun Chen (Microsoft); JIAN XUE (Microsoft Corporation); Jinyu Li (Microsoft); Yashesh Gaur (Microsoft) |
175 | Lae-St-Moe: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task For E2E Code-Switching Asr | Guodong Ma (NetEase Yidun AI Lab)*; Wenxuan Wang (NetEase Yidun AI Lab); Yuke Li (NetEase Yidun AI Lab); Yuting Yang (NetEase Yidun AI Lab); Binbin Du (NetEase Yidun AI Lab); Haoran Fu (Department of Civil Engineering, Zhejiang University) |
177 | A Token-Wise Beam Search Algorithm For Rnn-T | Gil Keren (Facebook)* |
181 | Joint Federated Learning And Personalization For On-Device Asr | Junteng Jia (Meta AI)*; Ke Li (Johns Hopkins University); Mani Malek (Meta); Kshitiz Malik (Meta); Jay Mahadeokar (Meta AI); Ozlem Kalinli (Meta); Frank Seide (Meta AI) |
183 | Melhubert: A Simplified Hubert On Mel Spectrograms | Tzu-Quan Lin (National Taiwan University)*; Hung-yi Lee (National Taiwan University); Hao Tang (The University of Edinburgh) |
185 | Exploring Data Augmentation In Bias Mitigation Against Non-Native-Accented Speech | Yuanyuan Zhang (Technische Universiteit Delft)*; Aaricia Herygers (-); Tanvina Patel (Multimedia computing, Delft University of Technology ); Zhengjun Yue (Technische Universiteit Delft); Odette Scharenborg (Multimedia Computing Group, Delft University of Technology) |
190 | Awmc: Online Test-Time Adaptation Without Mode Collapse For Continual Adaptation | Jae-Hong Lee (Hanyang University)*; Dohee Kim (Hanyang University); Joon-Hyuk Chang (Hanyang University) |
192 | Le-Ssl-Mos: Self-Supervised Learning Mos Prediction With Listener Enhancement | Zili Qi (Hithink RoyalFlush AI Research Institute)*; Xinhui Hu (Hithink RoyalFlush AI Research Institute); Wangjin Zhou (Kyoto University); Sheng Li (National Institute of Information & Communications Technology (NICT)); Hao Wu (Hithink RoyalFlush AI Research Institute); Jian Lu (Hithink RoyalFlush AI Research Institute); Xinkang Xu (Hithink RoyalFlush AI Research Institute) |
198 | Transcribing And Aligning Conversational Speech: A Hybrid Pipeline Applied To French Conversations | Hiroyoshi Yamasaki (Aix-Marseille University); Jérôme Louradour (Linagora); Julie Hunter (LINAGORA); Laurent Prevot (Aix Marseille Université & CNRS)* |
201 | Fedcpc: An Effective Federated Contrastive Learning Method For Privacy Preserving Early-Stage Alzheimer’S Speech Detection | wenqing wei (Japan Advanced Institute of Science and Technology); Zhengdong Yang (Kyoto University); Gao Yuan (Japan Advanced Institute of Science and Technology); Jiyi Li (University of Yamanashi); Chenhui Chu (Kyoto University); Shogo Okada (Japan Advanced Institute of Science and Technology); Sheng Li (National Institute of Information & Communications Technology (NICT))* |
204 | Toward General-Purpose Text-Instruction-Guided Voice Conversion | Chun-Yi Kuan (National Taiwan University)*; Chen An Li (National Taiwan University); Tsu-Yuan Hsu (National Taiwan University); Tse-Yang Lin (National Taiwan University); ho lam Chung (National Taiwan University); Kai-Wei Chang (National Taiwan University); Shuo-yiin Chang (Google); Hung-yi Lee (National Taiwan University) |
206 | Av-Data2Vec: Self-Supervised Learning Of Audio-Visual Speech Representations With Contextualized Target Representations | Jiachen Lian (University of California Berkeley)*; Alexei Baevski (Facebook AI Research); Wei-Ning Hsu (Meta); Michael Auli (Meta) |
207 | Improving Stability In Simultaneous Speech Translation: A Revision-Controllable Decoding Approach | Junkun Chen (Microsoft)*; JIAN XUE (Microsoft Corporation); Peidong Wang (Microsoft); Jing Pan (Microsoft); Jinyu Li (Microsoft) |
211 | Haha-Pod: An Attempt For Laughter-Based Non-Verbal Speaker Verification | Yuke Lin (Wuhan University); Xiaoyi Qin (Dukekunshan University); Ning Jiang (Mashang Consumer Finance Co., Ltd.); Guoqing Zhao (Mashang Consumer Finance Co., Ltd); Ming Li (Duke Kunshan University)* |
220 | Pp-Met: A Real-World Personalized Prompt Based Meeting Transcription System | xiang lyu (ximalaya)*; Yuhang Cao (ximalaya); qing wang (ximalaya); Jingjing Yin (Ximalaya); Yuguang Yang (Ximalaya Inc., ShangHai, China); pengpeng zou (ximalaya); xuecheng hu (ximalaya); yanni hu (ximalaya); heng lu (ximalaya) |
223 | Brouhaha: Multi-Task Training For Voice Activity Detection, Speech-To-Noise Ratio, And C50 Room Acoustics Estimation | Marvin Lavechin (ENS, Meta AI)*; Marianne Metais (ENS); Hadrien Titeux (ENS); Alodie Boissonnet (Meta AI); Jade Copet (Meta AI); Morgane Riviere (Meta AI); Elika Bergelson (Duke University); Alejandrina Cristia (Exelang, CNRS, LSCP); Emmanuel Dupoux (EHESS, ENS, PSL University, CNRS, INRIA, META); Hervé Bredin (CNRS) |
224 | Magnitude-And-Phase-Aware Speech Enhancement With Parallel Sequence Modeling | Yuewei Zhang (Shanghai Jiao Tong University)*; Huanbin Zou (Tencent); jie zhu (Shanghai Jiao Tong University) |
228 | Speaker Adaptation For End-To-End Speech Recognition Systems In Noisy Environments | Dominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm)*; Ilja Baumann (Technische Hochschule Nürnberg Georg Simon Ohm); Sebastian P Bayerl (Technische Hochschule Nürnberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm); Tobias Bocklet (TH Nürnberg ) |
233 | Improving Severity Preservation of Healthy-To-Pathological Voice Conversion With Global Style Tokens | Bence Halpern (Netherlands Cancer Institute)*; Wen-Chin Huang (Nagoya University); Lester Phillip G Violeta (Nagoya University); Rob van Son (Netherlands Cancer Institute); Tomoki Toda (Nagoya University) |
235 | End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis | Can Cui (Inria)*; Imran Sheikh (Vivoka); Mostafa Sadeghi (INRIA); Emmanuel Vincent (Inria) |
238 | GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition | Daniel Galvez (NVIDIA)*; Tim Kaldewey (NVIDIA) |
239 | Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition | Hillary Ngai (Google)*; Rohan Agrawal (Google); Parisa Haghani (Google); Pedro J Moreno (Google); W. Ronny Huang (Google); Neeraj Gaur (Google) |
243 | CAMSAT: Augmentation Mix and Self-Augmented Training Clustering for Self-Supervised Speaker Recognition | Abderrahim Fathan (Computer Research Institute of Montreal (CRIM), Montreal, Quebec, Canada)*; Jahangir Alam (Computer Research Institute of Montreal (CRIM), Montreal (Quebec) Canada) |
244 | Toward Universal Speech Enhancement For Diverse Input Conditions | Wangyou Zhang (Shanghai Jiao Tong University)*; Kohei Saijo (Waseda University); Zhong-Qiu Wang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Yanmin Qian (Shanghai Jiao Tong University) |
245 | Adversarial Augmentation for Adapter Learning | Jen-Tzung Chien (National Yang Ming Chiao Tung University)*; Wei-Yu Sun (National Yang Ming Chiao Tung University) |
246 | Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition And Phoneme To Grapheme Translation | Wonjun Lee (POSTECH)*; Yunsu Kim (POSTECH); Gary Geunbae Lee (Postech) |
248 | Ctc Blank Triggered Dynamic Layer-Skipping For Efficient Ctc-Based Speech Recognition | Junfeng Hou (Netease)*; Peiyao Wang (Netease); Jincheng Zhang (Netease); Meng Yang (Netease); Minwei Feng (Netease); Jingcheng Yin (Netease) |
252 | Prompt Pool Based Class-Incremental Continual Learning for Dialog State Tracking | Hong Liu (Tsinghua University)*; Yucheng Cai (tsinghua university); Yuan Zhou (None); Zhijian Ou (Tsinghua University); Yi Huang (China Mobile Research); Junlan Feng (China Mobile Research) |
253 | Model-based Fairness Metric for Speaker Verification | Maliha Jahan (Johns Hopkins University)*; Laureano Moro-Velazquez (Johns Hopkins University); Thomas Thebaud (Johns Hopkins University); Najim Dehak (Johns Hopkins University); Jesus Antonio Villalba (Johns Hopkins University) |
258 | The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains | Erica Cooper (National Institute of Informatics)*; Wen-Chin Huang (Nagoya University); Yu Tsao (Academia Sinica); Hsin-Min Wang (Academia Sinica); Tomoki Toda (Nagoya University); Junichi Yamagishi (National Institute of Informatics) |
263 | Cross-Modal Alignment with Optimal Transport for Ctc-Based Asr | Xugang Lu (NICT)*; Peng Shen (NICT); Yu Tsao (Academia Sinica); Hisashi Kawai (NICT) |
264 | Study on the Correlation between Objective Evaluations and Subjective Speech Quality and Intelligibility | Hsin-Tien Chiang (Academia Sinica); Kuo-Hsuan Hung (Academia Sinica); Szu-Wei Fu (NVIDIA); Heng-Cheng Kuo (Academia Sinica); Ming-Hsueh Tsai (National Academy for Educational Research ); Yu Tsao (Academia Sinica)* |
265 | Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model | Kai-Wei Chang (National Taiwan University)*; Ming-Hsin Chen (National Taiwan University); Yun-Ping Lin (National Taiwan University); Jing Neng Hsu (National Taiwan University); Paul KM Huang (NTU); Chien-yu Huang (National Taiwan University); Shang-Wen Li (FAIR); Hung-yi Lee (National Taiwan University) |
266 | VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model | Yayun He (Ping An Technology (Shenzhen) Co., Ltd)*; Zuheng Kang (Ping An Technology (Shenzhen) Co., Ltd); Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd); Junqing Peng (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China) |
268 | Zero-Shot Singing Voice Synthesis From Musical Score | Jun-You Wang (National Taiwan University)*; Hung-yi Lee (National Taiwan University); Roger Jang (); Li Su (Academia Sinica) |
271 | PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models | Robin Netzorg (UC Berkeley)*; Ajil Jalal (UC Berkeley ); Luna McNulty (Brown University); Gopala Krishna Anumanchipalli (UC Berkeley) |
272 | Boosting Modality Representation with Pre-trained Models and Multi-task Training for Multimodal Sentiment Analysis | Jiarui Hai (Johns Hopkins University)*; Yu-Jeh Liu (Johns Hopkins University); Mounya Elhilali (Johns Hopkins University) |
274 | Efficient Text-Only Domain Adaptation for CTC-based ASR | Chang Chen (Shanghai Jiao Tong University); Xun Gong (Shanghai Jiaotong University)*; Yanmin Qian (Shanghai Jiao Tong University) |
275 | Adapting Pretrained Speech Model For Mandarin Lyrics Transcription And Alignment | Jun-You Wang (National Taiwan University)*; Chon In Leong (National Taiwan University); Yu-Chen Lin (National Taiwan University); Li Su (Academia Sinica); Roger Jang () |
276 | Partial Rank Similarity Minimization Method for Quality Mos Prediction Oo Unseen Speech Synthesis Systems in Zero-Shot and Semi-Supervised Setting | Hemant Yadav (IIIT Delhi)*; Erica Cooper (National Institute of Informatics); Junichi Yamagishi (National Institute of Informatics); Sunayana Sitaram (Microsoft Research); Rajiv Ratn Shah (IIIT Delhi) |
282 | Coco-Nut: Corpus Of Japanese Utterance And Voice Characteristics Description For Prompt-Based Control | Aya Watanabe (The University of Tokyo)*; Shinnosuke Takamichi (The University of Tokyo); Yuki Saito ("The University of Tokyo, Japan"); Wataru Nakata (The University of Tokyo); Detai Xin (The University of Tokyo); Hiroshi Saruwatari (The University of Tokyo) |
284 | Generative Linguistic Representation For Spoken Language Identification | Peng Shen (NICT)*; Xugang Lu (NICT); Hisashi Kawai (NICT) |
287 | Spike-Triggered Contextual Biasing For End-To-End Mandarin Speech Recognition | Kaixun Huang (NWPU)*; Ao Zhang (Northwestern Polytechnical University); Binbin Zhang (Horizon Robotics); Tianyi Xu (NWPU); Xingchen Song (Tsinghua University); Lei Xie (NWPU) |
290 | Towards Robust Packet Loss Concealment System With Asr-Guided Representations | Da-Hee Yang (Hanyang University); Joon-Hyuk Chang (Hanyang University)* |
294 | U2-Kws: Unified Two-Pass Open-Vocabulary Keyword Spotting With Keyword Bias | Ao Zhang (Northwestern Polytechnical University)*; Pan Zhou (Li Auto Inc.); Kaixun Huang (NWPU); Yong Zou (Li Auto Inc. ); Ming Liu (Li Auto Inc.); Lei Xie (NWPU) |
295 | Consistency Based Unsupervised Self-Training For Asr Personalisation | Jisi Zhang (Samsung Research UK)*; Vandana Rajan (Samsung Research UK); Haaris Mehmood (Samsung Research UK); David Tuckey (Samsung Research UK); Pablo Peso Parada (Samsung Research UK); Md Asif Jalal (Samsung Research UK); KARTHIKEYAN SARAVANAN (Samsung Research, UK); Gil Ho Lee (Samsung Electronics); Jung In Lee (Samsung Electronics); Seokyeong Jung (Samsung Electronics) |
299 | Towards A Unified End-To-End Language Understanding System For Speech And Text Inputs | Mohan LI (Toshiba Europe Ltd)*; Catalin Zorila (Toshiba Cambridge Research Lab); Cong-Thanh Do (Toshiba Research Europe Ltd.); Rama S Doddipatla (Toshiba Europe LTD) |
301 | On Decoder-only Architecture for Speech-to-text and Large Language Model Integration | Jian Wu (Microsoft)*; Yashesh Gaur (Microsoft); Zhuo Chen (Microsoft); Long Zhou (Microsoft Research Asia); Yimeng Zhu (Microsoft China); Tianrui Wang (Microsoft Research Asia ); Jinyu Li (Microsoft); Shujie Liu (Microsoft Research Asia); Bo Ren (Microsoft); Linquan Liu (Microsoft China); Yu Wu (Microsoft Research Asia) |
302 | Paraconsistent Feature Analysis For the Competency Evaluation of Voice Impersonation | Rajeev Rajan (Government Engineering College, Barton Hill, Trivandrum)*; Noumida A (College Of Engineering Trivandrum); Sreelakshmi S (GOVERNMENT ENGINEERING COLLEGE, BARTON HILL) |
303 | Knowledge Distillation from Offline to Streaming Transducer: Toward Accurate and Fast Streaming Model by Matching Alignments | Ji-Hwan Mo (Hanyang University); Jae-Jin Jeon (Kakao Enterprise Corporation); MUNHAK LEE (Hanyang University); Joon-Hyuk Chang (Hanyang University)* |
304 | Transformer Attractors for Robust and Efficient End-to-end Neural Diarization | Lahiru T Samarakoon (Fano Labs, Hong Kong)*; Samuel J Broughton (Fano Labs); Marc Härkönen (Fano Labs); Ivan Fung (Fano Labs) |
308 | Detection of Vowel Errors in Children's Speech Using Synthetic Phonetic Transcripts | Ilja Baumann (Technische Hochschule Nürnberg Georg Simon Ohm)*; Dominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm); Elmar Noeth (friedrich Alexander Universitat, Erlangen-Nuremberg); Tobias Bocklet (TH Nürnberg ) |
312 | Invert-Classify: Recovering Discrete Prosody Inputs for Text-to-Speech | Nicholas J Sanders (University of Edinburgh)*; Korin Richmond (University of Edinburgh) |
313 | Kaq: A Non-Intrusive Stacking Framework for Mean Opinion Score Prediction with Multi-Task Learning | Chenglin Xu (Kuaishou Technology)*; Xiguang Zheng (北京达佳互联信息技术有限公司); Chen Zhang (北京达佳互联信息技术有限公司); Chao Zhou (Kuaishou Technology); Qi Huang (Kuaishou Technology); Bing Yu (Kuaishou Technology) |
316 | SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR | Yangze Li (Northwestern Polytechnical University)*; Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yuhao Liang (Northwestern Polytechnical University); Pengcheng Guo (Northwestern Polytechnical University); Mohan Shi (University of Science and Technology of China); Zhihao Du (Speech Lab of DAMO Academy, Alibaba Group); Shiliang Zhang (Alibaba Group); Lei Xie (Northwestern Polytechnical University) |
318 | Simulation of Teacher-Learner Interaction in English Language Pronunciation Learning | Elaf Islam (The University of Sheffield)*; Thomas Hain (University of Sheffield); Protima Nomo Sudro (University of Sheffield) |
320 | Ending The Blind Flight: Analyzing The Impact of Acoustic And Lexical Factors on Wav2Vec 2.0 in Air-Traffic Control | Alexander Blatt (Saarland University)*; Badr Abdullah (Saarland University); Dietrich Klakow (Saarland University) |
323 | Cross-modal learning for CTC-based ASR: Leveraging CTC-BERTScore and sequence-level training | MUNHAK LEE (Hanyang University); Sang-Eon Lee (Hanyang University); Jieun Choi (Hanyang University); Joon-Hyuk Chang (Hanyang University)* |
324 | Clustering Unsupervised Representations As Defense Against Poisoning Attacks on Speech Commands Classification System | Thomas Thebaud (Johns Hopkins University)*; Sonal Joshi (Johns Hopkins University); Henry Li Xinyuan (Johns Hopkins University); Martin Sustek (Johns Hopkins University); Jesús Antonio Villalba López (Johns Hopkins University (JHU)); Sanjeev Khudanpur (Johns Hopkins University); Najim Dehak (Johns Hopkins University) |
326 | ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings | Jenthe Thienpondt (IDLab, Ghent University)*; Kris Demuynck (Ghent Universitty) |
327 | Multitask Learning Model with Text And Speech Representation for Fine-Grained Speech Scoring | Seongjin Park (Educational Testing Service)*; Rutuja Ubale (Educational Testing Service Research) |
330 | Librispeech-Pc: Benchmark For Evaluation Of Punctuation And Capitalization Capabilities Of End-To-End Asr Models | Aleksandr Meister (NVIDIA)*; Matvei Novikov (NVIDIA); Nikolay Karpov (NVIDIA); Evelina Bakhturina (Nvidia); Vitaly Lavrukhin (NVIDIA); Boris Ginsburg (NVIDIA) |
331 | Evaluating Self-Supervised Speech Models on A Taiwanese Hokkien Corpus | Yi-Hui Chou (Carnegie Mellon University)*; Kalvin Chang (Carnegie Mellon University); Meng-Ju Wu (N/A); Winston Ou (Scripps College); Alice Wen-Hsin Bi (University of Maryland); Carol Yang (N/A); Bryan Y. Chen (Swarthmore College); Rong-Wei Pai (National Taiwan Normal University); Po-Yen Yeh (China Medical University, Taiwan); Jo-Peng Chiang (National Taiwan University); Iu-Tshiann Phoann (N/A); Winnie Chang (Carnegie Mellon University); Chenxuan Cui (Carnegie Mellon University); Noel Chen (Carnegie Mellon University); Jiatong Shi (Carnegie Mellon University) |
332 | Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition | Dima Rekesh (Nvidia)*; Nithin Rao Koluguri (NVIDIA); Samuel Kriman (NVIDIA); Somshubra Majumdar (NVIDIA); Vahid Noroozi (NVIDIA); He Huang (NVIDIA); Oleksii Hrinchuk (NVIDIA); Krishna C Puvvada (NVIDIA); Ankur Kumar (UCLA); Jagadeesh Balam (NVIDIA); Boris Ginsburg (NVIDIA) |
333 | Parameter-Efficient Cross-Language Transfer Learning For A Language-Modular Audiovisual Speech Recognition | Zhengyang Li (Technische Universität Carolo-Wilhelmina Braunschweig)*; Thomas Graave (Technische Universität Carolo-Wilhelmina Braunschweig); Jing Liu (Amazon.com); Timo Lohrenz (Technische Universität Carolo-Wilhelmina Braunschweig); Siegfried Kunzmann (Amazon.com); Tim Fingscheidt ( Technische Universität Braunschweig) |
336 | Generalized Zero-Shot Audio-to-Intent Classification | Veera Raghavendra Elluru (AWS AI Labs)*; Devang Kulshreshtha (Amazon); Rohit Paturi (AWS AI Labs); Sravan Babu Bodapati (Amazon); Srikanth Ronanki (Amazon) |
337 | Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers | Zijian Yang (Lehrstuhl fuer Informatik 6, RWTH Aachen)*; Wei Zhou (Chair of Computer Science 6, RWTH Aachen University); Ralf Schlüter (RWTH Aachen University); Hermann Ney ( RWTH Aachen University) |
339 | Torchaudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch | Jeff Hwang (Meta)*; Moto Hira (Meta); Caroline Chen (Meta); Xiaohui Zhang (Meta); Zhaoheng Ni (Meta AI); Guangzhi Sun (University of Cambridge Department of Engineering); Pingchuan Ma (Meta); Ruizhe Huang (Johns Hopkins University); Vineel Pratap (Facebook); Yuekai Zhang (NVIDIA); Anurag Kumar (Facebook Reality Labs); Chin-Yun Yu (Queen Mary University of London); Chuang Zhu (NVIDIA); Chunxi Liu (Two Sigma); Jacob D Kahn (Facebook AI Research); Mirco Ravanelli (Université de Montréal); Peng Sun (NVIDIA); Shinji Watanabe (Carnegie Mellon University); Yangyang Shi (Facebook); Yumeng Tao (Meta) |
340 | Deriving Translational Acoustic Sub-Word Embeddings | Amit Meghanani (University of Sheffield)*; Thomas Hain (University of Sheffield) |
342 | A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability | JIAN XUE (Microsoft Corporation)*; Peidong Wang (Microsoft); Jinyu Li (Microsoft); eric sun (Microsoft) |
343 | Transferring Speech-Generic and Depression-Specific Knowledge for Alzheimer'S Disease Detection | Ziyun Cui (Tsinghua University)*; Wen Wu (University of Cambridge); Chao Zhang (Tsinghua University); Wei-Qiang Zhang (Tsinghua University); Ji Wu (Tsinghua University) |
344 | Robust Logarithmic Champernowne Algorithm for Feedback Cancellation in Hearing Aids | Vanitha Devi R (National Institute of Technology Warangal)*; Vasundhara . (NIT Warangal) |
347 | Hierarchical Attention-Based Contextual Biasing for Personalized Speech Recognition Using Neural Transducers | Sibo Tong (Amazon)*; Philip Harding (Amazon Alexa); Simon Wiesler (Amazon) |
352 | E3 Tts: Easy End-To-End Diffusion-Based Text To Speech | Yuan Gao (Google)*; Nobuyuki Morioka (Google); Yu Zhang (Google); Nanxin Chen (Google) |
354 | Building High-Accuracy Multilingual ASR with Gated Language Experts And Curriculum Training | eric sun (Microsoft)*; Jinyu Li (Microsoft); Yuxuan Hu (Microsoft); Yimeng Zhu (Microsoft); Long Zhou (Microsoft Research Asia); JIAN XUE (Microsoft Corporation); Peidong Wang (Microsoft); Linquan Liu (Microsoft); Shujie Liu (Microsoft Research Asia); Ed C Lin (Microsoft); Yifan Gong (Microsoft) |
360 | Flap: Fast Language-Audio Pre-Training | Ching-Feng Yeh (Facebook AI Research)*; Po-Yao Huang (Facebook AI Research); Vasu Sharma (Facebook AI Research); Shang-Wen Li (FAIR); Gargi Ghosh (Facebook AI Research) |
361 | On The Relevance Of Phoneme Duration Variability Of Synthesized Training Data For Automatic Speech Recognition | Nick Rossenbach (RWTH Aachen University / AppTek GmbH)*; Benedikt Hilmes (HLT); Ralf Schlüter (RWTH Aachen University) |
362 | Enabling Noisy Label Usage for Out-Of-Airspace Data in Read-Back Error Detection | Lakshmi Rajendram Bashyam (ZBW - Leibniz-Informationszentrum Wirtschaft); Alexander Blatt (Saarland University)*; Dietrich Klakow (Saarland University) |
367 | Enhancing Expressivity Transfer in Textless Speech-to-Speech Translation | jarod duret (LIA)*; Benjamin O'Brien (LIA - Avignon University); Yannick Estève (LIA - Avignon University); Titouan Parcollet (Samsung AI Research) |
368 | Dialect Adaptation and Data Augmentation for Low-Resource ASR: Team XYZ Systems for the MADASR 2023 Challenge | Tanel Alumae (Tallinn University of Technology)*; Jiaming Kong (Tallinn University of Technology ); Daniil Robnikov (Tallinn University of Technology) |
370 | Reducing the Cost of Spoof Detection Labeling Using Mixed-Strategy Active Learning and Pretrained Models | Mark R Lindsey (Carnegie Mellon University)*; Nathaniel R Robinson (Carnegie Mellon University); Francis Kubala (Probity, Inc.); Richard M Stern (Carnegie Mellon University) |
373 | A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction | Kohei Saijo (Waseda University)*; Wangyou Zhang (Shanghai Jiao Tong University); Zhong-Qiu Wang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Tetsunori Kobayashi (Waseda University); Tetsuji Ogawa (Waseda University) |
375 | Two-Pass Endpoint Detection for Speech Recognition | Anirudh Raju (Amazon Alexa); Di He (Amazon); Aparna Khare (Amazon)*; Ilya Sklyar (Amazon); Long Chen (Amazon); Viet Anh Tranh (Amazon); Zhe Zhang (Amazon); Colin Vaz (Amazon); Sam Alptekin (Amazon); Venkatesh Ravichandran (Amazon); Roland Maas (Amazon Inc.); Ariya Rastrow (Amazon Alexa) |
379 | Improved Long-Form Speech Recognition by Jointly Modeling The Primary and Non-Primary Speakers | Guru Prakash Arumugam (Google LLC)*; Shuo-yiin Chang (Google); Tara Sainath (Google); Rohit Prabhavalkar (Google); Quan Wang (Google); Shaan Bijwadia (Google) |
380 | Exploring Time-Frequency Domain Target Speaker Extraction For Causal and Non-Causal Processing | Wangyou Zhang (Shanghai Jiao Tong University)*; Lei Yang (Samsung Research China – Beijing); Yanmin Qian (Shanghai Jiao Tong University) |
381 | Joint Energy-Based Model for Robust Speech Classification System against Dirty-Label Backdoor Poisoning Attacks | Martin Sustek (Brno University of Technology)*; Sonal Joshi (Johns Hopkins University); Henry Li Xinyuan (Johns Hopkins University); Thomas Thebaud (Johns Hopkins University); Jesus Antonio Villalba (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University); Najim Dehak (Johns Hopkins University) |
382 | Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR | Sheikh Shams Azam (Apple)*; Tatiana Likhomanenko (Apple); Martin Pelikan (Apple); Jan Silovsky (Apple) |
383 | Joint Audio and Speech Understanding | Yuan Gong (Massachusetts Institute of Technology)*; Alexander H Liu (MIT); Hongyin Luo (MIT); Leonid Karlinsky (IBM-Research); James Glass (Massachusetts Institute of Technology) |
385 | No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation | Dennis Fucci (Fondazione Bruno Kessler)*; Marco Gaido (Fondazione Bruno Kessler); Matteo Negri (Fondazione Bruno Kessler); Mauro Cettolo (Fondazione Bruno Kessler); Luisa Bentivogli (Fondazione Bruno Kessler ) |
387 | Improving Audiovisual Active Speaker Detection in Egocentric Recordings with the Data-efficient Image Transformer | Jason Clarke (University of Sheffield)*; Yoshihiko Gotoh (University of Sheffield); Stefan Goetze (University of Sheffield) |
391 | YODAS: Youtube-Oriented Dataset for Audio and Speech | Xinjian Li (Carnegie Mellon University)*; Shinnosuke Takamichi (The University of Tokyo); Takaaki Saeki (The University of Tokyo); William Chen (Carnegie Mellon University); Sayaka Shiota (Tokyo Metropolitan University); Shinji Watanabe (Carnegie Mellon University) |
392 | Discriminative Speech Recognition Rescoring With Pre-trained Language Models | Prashanth Gurunath Shivakumar (Amazon)*; Jari T Kolehmainen (Amazon); Yile Gu (Amazon.com, USA); Ankur Gandhe (Amazon Alexa); Ariya Rastrow (Amazon Alexa); Ivan Bulyko (Amazon) |
394 | Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection | Jiachen Lian (University of California Berkeley)*; Carly Z Feng (University of California, Berkeley); Naasir S Farooqi (UC Berkeley); Steve Li (Berkeley Speech Group); Anshul P Kashyap (UC Berkeley); Cheol Jun Cho (UC Berkeley); Peter Wu (UC Berkeley); Robin Netzorg (UC Berkeley); Tingle Li (UC Berkeley); Gopala Krishna Anumanchipalli (UC Berkeley) |
395 | Minisuperb: Lightweight Benchmark for Self-Supervised Speech Models | Yu-Hsiang Wang (National Taiwan University)*; Huang-Yu Chen (National Taiwan University); Kai-Wei Chang (National Taiwan University); Winston H. Hsu (National Taiwan University); Hung-yi Lee (National Taiwan University) |
399 | MASR: Multi-Label Aware Speech Representation Learning | ANJALI RAJ (Google); Shikhar Bharadwaj (Google); Sriram Ganapathy (Google); Min Ma (Google Research); Shikhar Vashishth (Google)* |
403 | A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023 | Ryuichi Yamamoto (LINE Corp.)*; Reo Yoneyama (Nagoya University); Lester Phillip G Violeta (Nagoya University); Wen-Chin Huang (Nagoya University); Tomoki Toda (Nagoya University) |
404 | Extending Self-distilled Self-supervised Learning for Semi-supervised Speaker Verification | Jeong-Hwan Choi (Hanyang University); Jehyun Kyung (Hanyang University); Ju-seok Seong (Hanyang University); Ye-Rin Jeoung (Hanyang University); Joon-Hyuk Chang (Hanyang University)* |
406 | Pseudo-label based Supervised Contrastive Loss for Robust Speech Representations | Varun Krishna PS Krishna (Indian Institute of Science)*; Sriram Ganapathy (Indian Institute of Science, Bangalore, India, 560012) |
409 | Audio-Visual Neural Syntax Acquisition | Cheng-I Lai (MIT)*; Haoyue Shi (Toyota Technological Institute at Chicago); Puyuan Peng (The University of Texas at Austin); Yoon Kim (MIT); Kevin Gimpel (Toyota Technological Institute at Chicago); Shiyu Chang (UCSB); Yung-Sung Chuang (MIT); Saurabhchand Bhati (Johns Hopkins University ); David Cox (MIT-IBM Watson AI Lab); David Harwath (The University of Texas at Austin); Yang Zhang (IBM T. J. Watson Research); Karen Livescu (TTI-Chicago); James Glass (Massachusetts Institute of Technology) |
411 | Improving Speech Enhancement Using Audio Tagging Knowledge From Pre-Trained Representations And Multi-Task Learning | Shaoxiong Lin (ShanghaiJiaoTongUniversity); Chao Zhang (Tsinghua University); Yanmin Qian (Shanghai Jiao Tong University)* |
412 | Ba-Moe: Boundary-Aware Mixture-Of-Experts Adapter For Code-Switching Speech Recognition | Peikun Chen (Northwestern Polytechnical University)*; Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yuhao Liang (Northwestern Polytechnical University); Hongfei Xue (NWPU); Xuchen Wan (Huawei Technologies Co., Ltd.); Naijun Zheng (Huawei Technologies Co., Ltd.); zhou huan (AARC, Huawei Technologies Co., Ltd.); Lei Xie (Northwestern Polytechnical University) |
413 | Zero-shot Domain-sensitive Speech Recognition with Prompt-conditioning Fine-tuning | Yung-Chieh Chan (MediaTek Research); Feng-Ting Liao (MediaTek Research)*; Chan-Jan Hsu (mediatek research); Yi-Chang Chen (Mediatek Research); Da-shan Shiu (MediaTek Research) |
414 | Few-Shot Spoken Language Understanding via Joint Speech-Text Models | Chung-Ming Chien (Toyota Technological Institute at Chicago)*; Mingjiamei Zhang (University of Chicago); Ju-Chieh Chou (TTIC); Karen Livescu (TTI-Chicago) |
415 | Summarize while Translating: Universal Model with Parallel Decoding for Summarization and Translation | Takatomo Kano (NTT Corporation)*; Atsunori Ogawa (NTT Corporation); Marc Delcroix (NTT); Kohei Matsuura (NTT); Takanori Ashihara (NTT Corp.); William Chen (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University) |
417 | Acoustic Model Fusion for End-To-End Speech Recognition | Zhihong Lei (Apple); Mingbin Xu (Apple Inc.)*; Shiyi Han (Apple); Leo Liu (Apple); Zhen Huang (Apple); Tim Ng (Apple); Yuanyuan Zhang (Apple); Ernest Pusateri (Apple Inc.); Mirko Hannemann (Apple); Yaqiao Deng (Apple); Man-Hung Siu (Apple) |
420 | Domain Adaptation by Data Distribution Matching via Submodularity for Speech Recognition | Yusuke Shinohara (Yahoo Japan Corporation)*; Shinji Watanabe (Carnegie Mellon University) |
422 | The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2Met 2.0): A Benchmark for Speaker-Attributed ASR | Yuhao Liang (Northwestern Polytechnical University)*; Mohan Shi (University of Science and Technology of China); Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yangze Li (Northwestern Polytechnical University); Shiliang Zhang (Alibaba Group); Zhihao Du (Speech Lab of DAMO Academy, Alibaba Group); Lei Xie (Northwestern Polytechnical University); Yanmin Qian (Shanghai Jiao Tong University); Jian Wu (Microsoft); Zhuo Chen (Microsoft); Kong Aik Lee (ICT Cluster, Singapore Institute of Technology); Zhijie Yan (Alibaba Inc.); Hui Bu (AISHELL) |
423 | Slm: Bridging The Thin Gap Between Speech and Text Foundational Models | Mingqiu Wang (Google Inc)*; Wei Han (Google); Izhak Shafran (Google AI); Zelin Wu (Google LLC); Chung-Cheng Chiu (Google); Yuan Cao (Google Brain); Nanxin Chen (Google); Yu Zhang (Google); Hagen Soltau (Google); Paul Rubenstein (Google); Lucas Zilka (Google); Dian Yu (Google); Golan Pundak (Google); Nikhil Siddhartha (Google.com); Johan Schalkwyk (Google); Yonghui Wu (Google) |
424 | An Exploration of Task-decoupling on Two-stage Neural Post Filter for Real-time Personalized Acoustic Echo Cancellation | Zihan Zhang (Northwestern Polytechnical University)*; Jiayao Sun (Northwestern Polytechnical University); Xianjun Xia (RTC Lab, ByteDance); Ziqian Wang (Northwestern Polytechnical University); Xiaopeng Yan (Northwestern Polytechnical University); Yijian Xiao (ByteDabce); Lei Xie (Northwestern Polytechnical University) |
425 | Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation | Zhaofeng Lin (Multimedia Computing Group, Delft University of Technology); Tanvina Patel (Multimedia computing, Delft University of Technology ); Odette Scharenborg (Multimedia Computing Group, Delft University of Technology)* |
426 | Leveraging The Multilingual Indonesian Ethnic Languages Dataset in Self-Supervised Model for Low-Resource ASR Task | Sakriani Sakti (Japan Advanced Institute of Science and Technology)*; Benita Angela Titalim (JAIST) |
429 | PromptSpeaker: Speaker Generation Based on Text Descriptions | yongmao zhang (Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China)*; Guanghou Liu (Northwestern Polytechnical University); Yi Lei (Northwestern Polytechnical University); Yunlin Chen (mobvoi); Hao Yin (mobvoi); Lei Xie (NWPU); Zhifei Li (Mobvoi) |
430 | HiGNN-TTS : Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS | Dake Guo (Northwestern Polytechnical University)*; Xinfa Zhu (Northwestern Polytechnical University); Liumeng Xue (The Chinese University of Hong Kong, Shenzhen); Tao Li (School of Computer Science, Northwestern Polytechnical University, Xi’an); Yuanjun Lv (Northwestern Polytechnical University); Yuepeng Jiang (Northwestern Polytechnical University); Lei Xie (NWPU) |
433 | Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis | Yuke Li (Audio, Speech and Language Processing Group (ASLP@NPU))*; Xinfa Zhu (Northwestern Polytechnical University); Yi Lei (Northwestern Polytechnical University); Hai Li (iQIYI Inc); Junhui Liu (iQIYI Inc); Danming Xie (iQIYI); Lei Xie (NWPU) |
435 | VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling | Ziqian Ning (Northwestern Polytechnical University)*; Yuepeng Jiang (Northwestern Polytechnical University); Bin Zhang (Tencent Music Entertainment Group(TME)); Lei Xie (NWPU); Zhichao Wang (Northwestern Polytechnical University) |
436 | SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation | Yuanjun Lv (Northwestern Polytechnical University)*; Jixun Yao (Northwestern Polytechnical University); Peikun Chen (Northwestern Polytechnical University); Hongbin Zhou (Ximalaya Inc.); Heng Lu (Ximalaya Inc.); Lei Xie (Northwestern Polytechnical University) |
440 | MUST: A Multilingual Student-Teacher Learning Approach for Low-Resource Speech Recognition | Muhammad Umar Farooq (University of Sheffield)*; Rehan Ahmad (University of Sheffield); Thomas Hain (University of Sheffield) |
441 | WaveNeXt: ConvNeXt-based fast neural vocoder without iSTFT layer | Takuma Okamoto (National Institute of Information and Communications Technology)*; Haruki Yamashita (Kobe University); Yamato Ohtani (National Institute of Information and Communications Technology); Tomoki Toda (Nagoya University); Hisashi Kawai (NICT) |
445 | Spectral Tilt May Have a Smaller Impact on The Intelligibility of Speech in Noise | Yoshiki Sato (University of Aizu)*; Julián Villegas (University of Aizu) |
447 | H_Eval: A New Hybrid Evaluation Metric For Automatic Speech Recognition Tasks | Zitha Sasindran (Indian Institute of Science)*; Harsha Yelchuri (Information Science Engineering RV College of Engineering Bengaluru, India); Prabhakar Venkata Tamma (Electronics Systems Engg); Supreeth Rao ( Indian Institute of Science) |
453 | Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments | Anusha Prakash (Indian Institute of Technology Madras)*; S Umesh (IIT Chennai); Hema A Murthy (IIT Madras) |
456 | Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech Recognition | Geoffroy Vanderreydt (IDLab)*; Amrutha Prasad (Idiap Research Institute); Srikanth Madikeri (Idiap); Driss Khalil (Idiap Research Institute); Kris Demuynck (Ghent Universitty); Petr Motlicek (Idiap) |
468 | Semi-Supervised Multi-Channel Speaker Diarization with Cross-Channel Attention | Shilong Wu (University of Science and Technology of China)*; Jun Du (University of Science and Technology of China); Mao-Kui He (University of Science and Technology of China); Shutong Niu (University of Science and Technology of China ); Hang Chen (USTC); Haitao Tang (iFLYTEK Research); Chin-hui Lee (Georgia Institute of Technology) |
473 | Gated Multi Encoders and Multitask Objectives For Dialectal Speech Recognition in Indian Languages | Sathvik Udupa (Indian Institute of Science)*; Jesuraj Bandekar (IISc); Deekshitha G (IISc); Saurabh Kumar (IISc Bengaluru); Prasanta Ghosh (); Sandhya Badiger (IISc Bangalore); Abhayjeet Singh (Indian Institute of Sciences, Bangalore, India); Savitha S Murthy (IISc); Priyanka Pai (Navana Tech, Mumbai); Srinivasa Raghavan (Navana Tech, Mumbai); Rohan Saxena (Navana Tech, Mumbai) |
478 | Vits-Based Singing Voice Conversion System with Dspgan Post-Processing for SVCC2023 | yiquan zhou (xjtu)*; Chen Meng (TME); Yi Lei (Northwestern Polytechnical University); Jihua Zhu (Xi'an Jiaotong University); weifeng zhao (tencent) |